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Abstract 

We present a new application for keyword 
search within relational databases, which uses 
a novel algorithm to solve the join discovery 
problem by finding Mcmex-like trails through 
the graph of foreign key dependencies. It dif- 
fers from previous efforts in the algorithms 
used, in the presentation mechanism and in 
the use of primary-key only database queries 
at query-time to maintain a fast response for 
users. We present examples using the DBLP 
data set. 

Keywords: Relational Databases, Hidden 
Web, Search, Navigation, Memex, Trails, Db- 
Surfer, Join Discovery, XML 

1 Introduction 

"Future users of large data banks must be 
protected from having to know how the data 
is organized in the machine (the internal rep- 
resentation)." 

E. F. Codd 

We consider that for many users of modern systems, 
being protected from the internal structures of point- 
ers and hashes is insufficient. They also need to be 
spared the requirement of knowing the logical struc- 
tures of a company or of its databases. For example, 
customers searching for information on a particular 
product should not be expected to know the address 
at which the relevant data is held. But neither should 
they be expected to know part numbers or table names 
in order to access this data, as required when using 
SQL. 

Much of today's corporate data resides in relational 
databases, comprising a large chunk of what is known 
as the "hidden" or "deep" web. The word "hidden" 



means that, from a practical point of view, this data 
is hidden from conventional search engines; the word 
"deep" is intended for greater accuracy, meaning that 
the data can only be accessed through a specialised 
query interface. It is estimated that the deep web 
contains 500 times more information than is visible 
to conventional search engines [Hj. 

One way for users to access data in the deep web is 
through a site-specific search engine, such as the query 
interface at Amazon.com One can imagine that Ama- 
zon have a relational database storing all their cata- 
logue information, over which the full-text query facil- 
ity was developed. Research shows that users actively 
use such interfaces and expect major web sites to sup- 
port unstructured search facilities These inter- 
faces are more natural that the SQL syntax supported 
directly by the database. However, the full-text search 
will result in a loss of expressiveness relative to the full 
expressive power of SQL, which is an issue that we will 
partially explore. One can argue that many end users 
do not need access to the full expressive power of SQL. 
Studies of keyword-based search engines on the web 
have shown that users type short queries, rarely use 
advanced features and are typically bad at query re- 
formulation (331201 EH OH]. It is likely that profiles for 
users of a database search facilities will reveal similar 
behaviour. 

Vannevar Bush's seminal 1945 paper "As We May 
Think" first suggested the concept of a trail as a 
sequence of connected pages, with a future machine 
called Memex which would help the user build a "web 
of trails" [H]. The concept of trails is well established 
in the hypertext community . These sequences of 
pages have also been referred to as tours or paths and 
several hypertext systems have allowed for their con- 
struction, but no previous system has allowed the au- 
tomated construction of these trails or allowed the con- 
struction of trails across tables in relational databases. 

We have previously developed a tool which automates 
trail discovery for providing users with navigational 



assistance and search facilities whilst surfing a web 
site. We first introduced the system in and intro- 
duced a new graph-based interface alongside the work 
with automated Javadoc documentation in |4H . The 
navigation engine works by finding trails - sequences 
of linked pages, which are relevant to the user query. 
These trails are presented to the user in a tree-like 
structure with which they can interact. User studies 
have shown the value of providing contextual informa- 
tion in our combined search and navigation interface 
|25| . In a serious of tasks related to the UCL web site, 
users found the information they were looking in less 
time, with fewer clicks, and with a higher degree of 
satisfaction compared with using the Google index or 
UCL's own Compass system (which has subsequently 
been replaced). 

Building on this work with hyperlinked web pages, 
we have developed a tool called DbSurfer which pro- 
vides an interface for extracting data from relational 
databases. This data is extracted in the form of an 
inverted index and a graph, which can together be 
used to construct trails of information, allowing free 
text search on the contents. The free text search and 
database navigation facilities can be used directly, or 
can be used as the foundation for a customized inter- 
face. We hope that the trail structure and interface 
provided will provide the same benefits for users of 
the database search and assisted navigation facilities 
as for users of the web site interface. 

The rest of the paper is organized as follows. In sec- 
tion 121 we describe our methods of indexing the con- 
tent of a relational database for keyword search. In 
sectionOwe describe the algorithms for extending this 
to compute joins by building trails. Section 0] gives 
an overview of the system's architecture. Section [S] 
discusses the work done to incorporate XML index- 
ing into the system and section El discusses how this 
compliments the preceeding work to provide expres- 
sive queries for solving user's information needs. Sec- 
tion gives examples of this technique using DBLP 
data. Section |H| gives an overview of preliminary work 
into the evaluation of DbSurfer and related systems. 
Section discusses related work and in section 1101 we 
discuss directions for future research. 

2 Indexing a Relational Database 

2.1 Prom Relations to a Full- Text Index 

A single relation (or table) is a set of rows each of 
which can be addressed by some primary key. To in- 
dex these rows we extract the data from each row in 
turn and construct a virtual document or web page, 
which is indexed by our parser. This parser will recog- 
nize web content and handle document formats such 
as Postscript, PDF, Microsoft Office, Shockwave Flash 



and RPM package formats which may be stored as 
binary objects in a database, but over which indexes 
may never be created. The textual content of this doc- 
ument is extracted and stored in an inverted file |15| . 
During the parsing stage, URLs arc retrieved which 
reference other web pages. These URLs may be sent 
to our crawler and the pages added to the same index. 
The inverted file is indexed such that the posting lists 
contain normalized tf.idf entries as prescibed in \IVA\ 
although many variations are possible 0]. 

Whilst these virtual documents are transient and ex- 
ist only for the time it takes to be indexed, the en- 
tries in the posting lists provide references to a servlet 
which will reproduce a customized page for each row 
entry. This is achieved by extracting the data, convert- 
ing it to XML using a SAX generator, and applying 
an XSLT stylesheet to the resulting page ^Hl • Binary 
data is handled with a separate servlet accessed via 
links from these pages. The data for these pages is 
always accessed via a primary key, so the page display 
is almost instantaneous. This is essential for provid- 
ing the quick responses that users insist on |27| . It is 
a practical impossibility to guarantee response times 
on large databases when queries may contain full ta- 
ble scans and much work goes into avoiding them in 
traditional e-commerce systems |14| . 

The primary key may not be a convenient index to 
embed in a url format. For example, it may be a com- 
posite key with a large number of attributes or even 
a binary object (BLOB). To cover these possibilities 
and make the system robust we create a second iden- 
tifier which identifies this key, giving a two step lookup 
process. This index is held externally to the database. 
Oracle databases contain a unique rowid for each table 
which we can index, saving us from this two-stage pro- 
cess. Similar optimizations exist for other databases, 
but these have yet to be fully exploited. 



2.2 Generating the Link Graph 

Answers to user's queries may not be contained in a 
single table. Often the results are spread over several 
tables which must be joined together. We can answer 
such queries with the help of a link graph. We have 
shown how we can create an inverted file containing 
URLs, some of which reference traditional web pages 
and some of which reference servlets which return cus- 
tomized views of database content. All these URLs 
arc assigned a separate 32-bit number which identifies 
them. It is these numbers which arc stored in the in- 
verted file, and it is these numbers which are stored in 
the link graph. 

The link graph is constructed by examining the foreign 
key constraint of the database (either by accessing the 
data dictionary table or via the JDBC APIs) and the 



data entries themselves. Each matching set of {table, 
attribute) pairs where there is a recognized referential 
constraint generates a bi-directional link. Each row 
entry is converted to a URL and the indexes for these 
URLs are added to the link graph. The set of links be- 
tween web pages and between database rows and web 
pages is also added to this graph. The approach is 
equivalent to the Linkl database presented in j29| and 
the same techniques for improving the memory usage 
characteristics should work equally well in this case. 
These techniques are not used in our DBLP demo 
as the DBLP example is sufficiently small to be eas- 
ily contained without compression, and the increased 
query time due to the cost of compression would be an 
unneccessary sacrifice. The strength of this approach 
is that it allows transparent access to the database in 
a manner which is compatible with access to any other 
web page on the web site and for relational data to be 
joined with relevant web data. 

3 Computing Joins with Trails 

Given the graph of related elements, we can utilise 
our navigation engine approach to construct join se- 
quences as trails. The navigation engine works in 4 
stages. The first stage is to calculate scores for each of 
the nodes matching one or more of the keywords in the 
query, and isolate a small number of these for future 
expansion. The second stage is to construct the trails 
using the Best Trail algorithm ^U]. The third stage 
involves filtering the trails to remove redundant infor- 
mation. In the fourth and final stage, the navigation 
engine computes small summaries of each page or row 
and formats the results for display in a web browser. 

Each node (whether database row or web page) is 
scored using Salton's normalized tf.idf metric al- 
though other IR metrics can be used. Selection of the 
starting points is done by combining the tf.idf scores 
with a node ranking metric called potential gain, which 
rates the navigation potential of a node in a graph 
based upon the number of trails available from it. 

The best trail algorithm takes as input the set of start- 
ing nodes and builds a set of navigation trees, using 
each starting point as the root node. Two series of it- 
erations are employed for each tree using two different 
methods of probabilistic node selection. Once a sufh- 
cient number of nodes have been expanded, the high- 
est ranked trail from each tree is selected. The subse- 
quent set of trails is then filtered and sorted. Figure^ 
shows the algorithm in more detail. In this figure, S 
represents the set of starting points, M represents an 
optional number of repetitions to be performed, reduc- 
ing the element of chance in the calculation and 

^expand 

and Iconverge control the number of expansion and con- 
vergence iterations. D represents the navigation tree, 
which grows according to the average cardinality of 



the records in the database. The maximum size of D 
is fixed and adding nodes within _D is a trivial opera- 
tion. A single leaf node, t, of D, referred to as a tip, is 
selected during each iteration, p is a function from the 
set of trails to the set of real numbers, used to assign 
scores to the trails for selection. Two functions have 
been chosen specifically to allow a 0{Log{\D\)) selec- 
tion time. The chosen tip is expanded and the linked 
nodes are assigned new tips in D. After the expan- 
sion and convergence iterations have been completed 
the highest ranked trail from each expanded starting 
point is selected by the function 6est() and the result- 
ing trail is added to the set of candidate trails, B. df 
is a discrimination factor which speeds up the conver- 
gence process and forces behaviour closer to that of 
a best-first approach. With appropriate choice of pa- 
rameters {I expand = 0, df ^ 0), the best trail algorithm 
can emulate the simpler best-first algorithm. 

Algorithm 1 (Best.Trai^S", M, p)) 
1. begin 



2. for each u E S 

3. for i to M do 

4. {u}; 

5. for j = to I explore do 

6. t ^ select{D, p); 

7. D ^ expand{D,t); 

8. end for 

9. for j = to I converge do 

10. t ^ select{D, p,df,j); 

11. D ^ expand{D,t); 

12. end for 

13. B ^ B [J {best{D)} 

14. end for 

15. end for each 

16. return B 



17. end. 



Figure 1: The Best Trail Algorithm. 

The trails are scored according to two simple metrics: 
the sum of the unique scores of the nodes in the trail 
divided by the length plus a constant, and the weighted 
sum of node scores, where weights are determined by 
the position in the trail, and the number of repeti- 
tions of that node. Nodes which occur early in the 
trail receive a higher weight, whilst nodes which occur 
later or are repeated receive a lower weighting. These 
functions encourage non-trivial trails, whilst discour- 
aging redundant nodes. Two navigation trees are con- 
structed from each node, one for each of these func- 
tions. All trail ranking is done by comparing firstly 
the number of keywords matched in a trail, secondly, 
the greatest number of keywords matched by any given 
node in the trail and finally, the trail score. 

Filtering takes place using a greedy algorithm and re- 



moves any sequences of redundant nodes which may be 
present in the trail. Redundant nodes are nodes which 
are either deemed to be of no relevance to the query 
or replicate exactly content found in other nodes. It 
should be noted that this concept can easily be ex- 
tended to include removal of near-duplicates 

Once they have been filtered and sorted, the trails are 
returned to the user and presented in our NavSearch 
interface, the two main elements of which arc a navi- 
gation tool bar comprising of a sequence of URLs (the 
"best trail") and a navigation tree window with the 
details of all the trails. The content of any row can be 
examined by clicking on any likely looking entry or by 
examining the summary data in the enhanced tooltips. 

4 Architecture 

Conventional web search engines usually use an archi- 
tecture pattern comprising three components - a robot 
or crawler, an indcxer and a query engine |28[ 171 1^ . 
We also follow this design but augment the informa- 
tion retrieval engine with our trail finding system. In 
addition, we augment the crawler with the database 
indexer described above. A key difference betweeen 
the DbSurfer and a conventional search engine is that a 
search engine traditionally returns links to pages which 
are logically and physically separated from the pages of 
the servers performing the query operations, whereas 
the links returned by the DbSurfer refer mostly to the 
row display servlet we have described. 

Figure [21 shows the detailed architecture. The data 
from the database is retrieved by the DbReader when 
the index is built and by the display servlet when ex- 
amining the constructed trails. 

The database indexer (or reader) works by connecting 
to the database, selecting all the accessible tables and 
views available, and asking the administrator which 
of these should be indexed. The program will then 
extract the referential constraints for all of the selected 
tables and build a lookup table. This is kept separate 
from the main index and used by both the indexer and 
the display servlet. 

5 Semi-Structured Data and XML 

A relational database can be viewed as a special case 
of a more general model of semistructurcd data and 
XML m. Hence it might not be suprising that we 
can handle XML data using DbSurfer. Indeed that is 
all DbSurfer docs! The virtual documents alluded to 
in section 121 arc XML representations of relational tu- 
ples. Figure 121 shows an example of this from a row in 
the DBLP schema discussed in section (3 The super- 
fluous row element has been added for compatibility 
with the emerging SQL/XML standard ^Jj. We note 



that the proposed standard includes generation of an 
XML Schema which we neither construct nor require 
at present. 

Attribute names are also indexed as individual key- 
words so that a query "Anatomy of a search engine 
author" should return trails from the Anatomy paper 
to the entries for Sergey Brin and Larry Page. XML 
documents discovered on web sites are automatically 
recognized as such and can be indexed in the same 
way, as can XML documents stored in the database, 
thus increasing coverage. 

6 Query Expressiveness 

We have extended the search engine style query syntax 
to support an attribute container operation using the 
"=" sign. The construct x = y means that an attribute 
y must be contained in an XML tag x. For example, 
the query "Simon" might return publications relating 
to Simon's probabilistic model as well as articles by au- 
thors named Simon. The query author=simon would 
restrict the returned entries to those contained in an 
XML attribute (author), which translates to those in 
the author table, i.e. publications written by au- 
thors named Simon. This is achieved by indexing at- 
tribute, value pairs in the inverted file. The approach 
is expensive in its use of disk space but retains fast 
access. The search engine query operations such as 
+, - and link: still remain supported with this ex- 
tension. Thus a query "Computers -type=phdthcsis - 
type=mastersthesis" would return books, journals and 
articles on Computers, but no theses. This syntax does 
require some knowledge of either table or attribute 
names, but exists as an option to allow those with 
such knowledge to gain greater control. 

This means we can provide trails which answer dis- 
junctive queries (the default), with preference for re- 
sults containing as many keywords as possible (con- 
junctive). We can also force the return of trails 
containing only specific keywords or which exclude 
certain keywords. We can also use the attribute 
syntax to provide more complex selection. For 
example, the query "Computers -type=phdthesis - 
type=mastersthesis" would be equivalent (using the 
DBLP webcase) to the query 



select * from publication 
where title like '%Computers%' 
where type <> 'phdthesis' 
and type <> 'mastersthesis' 



This is not a major saving. However, a researcher who 
is trying to find the year of publication of Brin and 




Figure 2: Architecture of DbSurfer. Closed boxes represent the external sources of data which the user is 
interested in; open boxes represent internal data stores; unshaded circles represent processes typically associated 
with search engines; shaded circles represent processes unique to DbSurfer; solid arrows represent data flow and 
dotted arrows represent flows of important information (URLs and Queries). Simple keyed "get" instructions 
(for example in HTTP requests) are omitted for clarity. 



1. (PUBLICATION) 

2. (row) 

3. (JOURNAL) Advances in Computers (/JOURNAL) 

4. (KEY) journals/ac/Dam66 (/KEY) 

5. (PAGES) 239-290 (/PAGES) 

6. (TITLE) Computer Driven Displays and Their Use in Man/Machine Interaction. (/TITLE) 

7. (TYPE) article (/TYPE) 

8. (URL) http://dblp.uni-trier.de/db/journals/ac/ac7.html#Dam66 (/URL) 

9. (VOLUME) 7 (/VOLUME) 

10. (YEAR) 1966 (/YEAR) 

11. (row) 

12. (/PUBLICATION) 



Figure 3: Example XML entry extracted from the DBLP Schema. 



Page's search engine paper [7] could find the answer 
with a query such as "sergey anatomy" , whereas the 
fuU SQL required would be: 



select year from publication, writes, author 
where lower(author.name) like 'sergey%' 
and lower(publication. title) like 'anatomy%' 
and writes. publication ~ publication. key 
and writes. author ~ author. id; 



We believe the DbSurfer expression represents a sig- 
nificant saving in time and complexity for the user, 
whilst still returning the desired result. Using Ora- 
cle's explain plan function |14| to examine the ac- 
tions of the Oracle database when performing this 
query reveals that 8 operations are required to com- 
plete this query including a full table scan. Other 
relational databases are likely to offer similar perfor- 
mance. In comparison, the DbSurfer results require no 
database accesses to compute the trails, and require 
only 3 index-only accesses to examine the relevant en- 
tries, showing that DbSurfer can provide results which 
provide savings in database activity as well as user in- 
put. 



7 Examples 

In order to highlight the differences between the 
varying keyword-based systems for indexing relational 
database content, we have followed Hulgeri's lead in in- 
dexing the content of a relational database containing 
DBLP data HH [211 . The DBLP data is downloaded as 
an XML file which we then parsed to create the schema 
shown in figure^] There arc four tables in the schema. 
The publication table (230000 rows, 300Mb) holds 
details of all the journal, article and book entries. The 
author table (150000 rows, 20Mb) contains details of 
each individual author, and the writes table (480000 
rows, 20Mb) links these together. The citation ta- 
ble (100000 rows, 13Mb) links publications with those 
which refcrence them. 

We have made the DBLP interface available to the 
public as a demonstration of DbSurfer's potential. 
This demo can be reached from the homepage for Birk- 
beck College School of Computer Science's Web Nav- 



igation Group at pittp: / / nzone.dcs.bbk.ac.uk/l Fig- 



ure[Sland figure|n|show two examples of the NavSearch 
interface used for both database and web search. 

FigurcElshows results for the query "scrgcy anatomy" . 
The first trail shows the entries for Sergey Brin and 
Brin and Page's much-cited paper "Anatomy of a 
Large Scale Hypertextual Web Search Engine" In 
this example, the remaining trails are single-node trails 



describing other authors called Sergey and other pa- 
pers with anatomy in the title. 

Figure |S1 shows results for the query "vannevar bush". 
The first trail is a singleton node showing the author 
entry for Vannevar Bush. The second shows Bush's 
paper "As we may think" P] in the context of a citation 
by a later work. The third trail shows two papers 
describing work related to Vannevar Bush and Memex, 
both by James M. Nyce. 

It should be noted that the DBLP already has a search 
system designed specifically for researcher's needs. 
The DbSurfer system cannot hope to replace all the 
functionality of a custom system or of a relational 
database. The reason for choosing the DBLP as a 
demonstration is to allow better testing and compar- 
ison with similar databases-indexing systems. How- 
ever, DbSurfer would allow the rapid deployment of 
a search and navigation interface in situation where 
no such interface exists. Secondly, DbSurfer can allow 
the development of a custom system by using XSLT 
stylesheets to format results. In many cases, missing 
features and aggregation of results can be added by 
constructing views at the database level. 



8 Evaluation 

As a preliminary evaluation into the relative perfor- 
mance of DbSurfer, we ran two experiments. These 
were performed on a server with IGHz dual Pentium 
III processors. 

In the first experiment, we selected the 20 papers found 
in the DBLP corpus, with the highest ranks in the 
Researchlndex (CiteSeer) "most accessed documents" 
list. From this we constructed 20 queries by taking the 
surname of the first author and 1, 2 or 3 significant 
keywords with which a user might expect to identify 
that paper. We submitted these queries to DbSurfer 
for evaluation. We also submitted them to compared 
BANKS (Browsing ANd Keyword Search in relational 
databases) and CiteSeer [5l] for comparison. The 
results arc shown in figure [3 The key result is that 
DbSurfer performs well (and outperforms BANKS and 
Citeseer) in finding requested references. The table 
shows reciprocal ranks for the desired paper, in terms 
of the trail, page or cluster containing the relevant ci- 
tation. Only the first page of results was considered 
in each case, but this should have minimal impact on 
the results. Times are shown as reported by each of 
the systems concerned and arc not strictly compara- 
ble, but are intended to be indicitive of the general 
level of performance. Times are also missing for those 
queries for which the BANKS system failed to return 
any results. 

This result is encouraging, but may be misleading in 
places. The poor retrieval performance of BANKS is 
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fjKEY : VARCHAR2 
jiTYPE : VARCHAR2 
jlEDITOR : VAR0HAR2 
jiTITLE : VARCHAR2 
jiBOOKTITLE : VARCHAR2 
iPAQES : VARCHAR2 
liYEAR : NUMBER 

iDDRESS : VARCHAR2 

lOURNAL : VARCHAR2 
jiVOLUME : NUMBER 
jiJOURNAL_NUMBER : VARCHAR2 
jiMONTH : VARCHAR2 
jURL : VARCHAR2 
^,EE : VARCHAR2 
(PUBLISHER : VARCHAR2 
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lISBN :OHAR 

,|SERIES_TITLE : VARCHAR2 
jSERIES_URL : VARCHAR2 
LsCHOOL : VARCHAR2 
|CHAPTER : NUMBER 
iPK = KEY 



CROSSREF = KEY 



Figure 4: UML Diagram showing the DBLP Schema. The pubhcation table stores details of all the journal, 
article and book entries, indexed by the attribute key. The citation table refers to two publication entries 
using the foreign keys cited and citing. Finally, The Author table is indexed on the primary key id, and is 
linked to the publication table by the writes table, whose foreign keys are publication, which refers to the 
key attribute in the publication table and author which refers to the id field in the author table. 
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Figure 5: Example results using DbSurfer for the query "sergey anatomy". 
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Figure 6: Example results using DbSurfer for the query "vannevar bush" . 



largely due to its poor coverage as it indexes only a 
subset of the DBLP data set. The 21.38 second re- 
sponse time for the query "nilsson routers" is due to 
bad configuration and behaviour of the garbage col- 
lector. However, a top-and-tailed average time of 1.2 
seconds is still disappointingly short of the sub-second 
response time expected. More worrying is that a third 
of queries failed to return the desired document in any 
of the returned trails. However, over half the desired 
documents where identified in the best trail for each 
query, suggesting that the trail-finding scheme can be 
highly effective. 

The second experiment provided a closer analysis of 
the times taken is computing the results. By isolat- 
ing two papers and requesting them with a decreas- 
ing number of keywords, we could analyse the times 
taken to perform each component operation. Comput- 
ing scores for nodes takes around 50% of the total pro- 
cessing time, with the trail finding taking around 30%, 
computing the text summaries around 15%, filtering 
redundant information around 2%, with the remain- 
der being taken up by system overhead, XML trans- 
formation and presentation. Increasing the number of 
keywords causes a limited increase in the time to com- 
pute page scores, but this impact is dwarfed by other 
factors. One other interesting result is that as the 
number of keywords increases so does the fraction of 
nodes in the returned trails which are distinct for the 
entire trailset. Only extensive user testing will confirm 
whether this is a positive feature. 



9 Related Work 

Recent work at Microsoft Research, at the Indian In- 
stitute of Technology, Bombay and at the University 
of California has resulted in several systems similar in 
many ways to our own. However, the system we de- 
scribe differs greatly in the design of the algorithms 
and in the style of the returned results. Our system 
also offers the opportunity for integrating both web 
site and database content with a common interface 
and for searching both transparently. 

BANKS was developed by the Indian Institute of Tech- 
nology • Each result in the BANKS systems is a 
tree from a selected node, ordered by a relevance func- 
tion which factors in node and link weights. Mragyati, 
also developed at the Indian Institute of Technology, 
uses a similar approach in which keyword queries are 
converted to SQL at query time [33]. This approach 
has some notable advantages. It guarantees that all 
data being searched on is fresh, whereas DbSurfer only 
ensures that the displayed data is fresh - the data in 
the inverted file will need to be periodically updated 
to ensure that it is not "stale" . The authors claim 
that the approach "is scalable, as it does not build 
an in-memory graph" . This is a legitimate criticism 
of DbSurfer's approach. However, allowing almost ar- 
bitary selection of attributes for querying and relying 
on the databases own indexes restricts the indexing of 
binary fields to those supported by the database (usu- 
ally in non-standard components) and makes full-table 
scans probable, introducing a new problem in scalabil- 
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Figure 7: Comparison of reciprocal rank and total time taken for 20 citation-seeking queries on DbSurfer, 
BANKS and CiteSeer. 



ity and response time. In addition, research has shown 
that large graphs (e.g. a few billion nodes) can be 
stored and manipulated in main memory of mid-range 
servers when appropriate compression techniques are 
usedHES]. 

DBXplorer |3] was developed by Microsoft Research, 
and like BANKS and Mragyati, it uses join trees to 
compute an SQL statement to access the data. The 
algorithm to compute these differs, as does the im- 
plementation, which was developed for Microsoft's IIS 
and SQL Server, the others being implemented in Java. 
DbSurfer does not require access to the database to 
discover the trails, only to display the data when user 
clicks on a link in that trail. 

DISCOVER is the latest offering and shares many sim- 
ilarities to Mragyati, BANKS and DbXplorer, but uses 
a greedy algorithm to discover the minimal joining 
network ^H]. It also takes greater advantage of the 
database's internal keyword search facilities by using 
Oracle's Context cartridge for the text indexing. 

Goldman et al. have also introduced a system for key- 
word search |12| . Their system works by finding results 
for queries of the form x near y (e.g. find movie near 
travolta cage). Two sets of entries are found - and the 
contents of the first set are returned based upon their 
proximity to members of the second set. In compari- 
son to DbSurfer, there is no support for navigation of 
the database (manual or assisted) nor any display of 
the context of the results. 



The join discovery problem is related to the problem 
tackled by the universal relation model |39| |22j . The 
idea underlying the universal relation model is to al- 
low querying the database soley through its attributes 
without explicitly specifying the join paths. The ex- 
pressive querying power of such a system is essentially 
that of a union of conjunctive queries (see [HS)- Db- 
Surfer takes this approach further by allowing the user 
to specify values (keywords) without stating their re- 
lated attributes and providing relevance based filter- 
ing. 

Goldman and Widom outline an approach for the re- 
lated problem of allowing structured database queries 
on the web WSQ/DSQ (pronounced "wisk- 

disk") is a combination of two systems for Web- 
Supported and Database-Supported Queries. WSQ al- 
lows structued queries on web data, by allowing two 
virtual tables, WebPages{SearchExp, Ti, T2-.-Tn, 
U RLRank, Date) and WebCount(SearchExp, Ti, 
T2 ■ ■ - Tn, Count), both of which can be queried along- 
side normal RDBMS tables. A similar approach to 
|13j is adopted by Squeal, which provides page, tag, 
att, link and parse tables which can be queried using 
SQL 121] . It would be possible to extend the DbSurfer 
engine to provide such functionality by adding appro- 
priate stored procedures to the database. These could 
request data from DbSurfer (using SOAP or a similar 
RPC protocol) and map the returned trail information 
to an appropriate schema, such as that described by 
Heather and Rossiter |17| . 



10 Future Work 

In addition to improving the quality of the overall re- 
sults and speed of delivery, the issue of incremental 
updates needs to be addressed. Theoretically, this 
could be achieved by storing a simple checksum of the 
database field values alongside the main index. The 
database could then be queried for those rows where 
the checksum is different. This restriction could be 
added at the database level, or in DbSurfer prior to 
construction of the virtual document. When this work 
is finished, several other key problems will still remain. 

10.1 Queries 

The current system does not handle range queries. In 
fact DbSurfer does not handle numbers very well - it 
works only using the text representation. We can im- 
prove this situation by following ideas presented in [2] . 
The system described recognizes numbers in both doc- 
uments and queries and looks for close matches. The 
same strategy can be extended to dates, by convert- 
ing all date representations to numeric values. Given 
attribute- value pairs in the inverted file, we can imple- 
ment some aggregate functions by combining values at 
query time. An alternative strategy is to index views 
created at the database level, but this requires a good 
understanding of the values which are likely to be ag- 
gregated. 

The evaluation of the system is encouraging, but lim- 
ited. In order to achieve a more comprehensive com- 
parison, we propose the creation of an independant 
test suite for database keyword-search, with a compe- 
tition run on similar lines to the TREC conference, 
perhaps as a workshop associated with a major con- 
ference. 

10.2 Presentation 

Some presentation issues exist for the row display 
servlet. Backlink handling, for example, is an issue. 
When navigating the database structure it should be 
possible to examine those rows which reference a given 
attribute. This can be achieved by using a separate 
servlet to generate the list of rows, which might op- 
erate by submitting another query with the command 
link: currenturl. This would return a list of rows 
which reference the current page's underlying row. 
With the appropriate query modification, this could 
be extended to restrict entries to the user's require- 
ments. 

Another issue is the handling of multipart keys. Each 
foreign key field is currently displayed as an outlink. 
However, this method of display does not extend to 
multipart or composite keys. In particular, it will not 
work for composite keys where one of the component 



attributes is a foreign key for some other table. In such 
a situation it is unclear were the destination of such a 
link should be. 

10.3 Security 

Security is a major issue. By constructing a single 
index we remove the fine-grained access controls em- 
ployed by the database. Since all indexing is done 
through a single user account, the access rights for 
all DbSurfer users are equivalent to the access rights 
of that user. One possible way to restore some of the 
fine-grained security may be to allow each user to view 
the data only under the database username and pass- 
word which they supply. Such a system might be im- 
plemented using container managed security which is 
part of the J2EE standard. This would require some 
very simple server configuration and a view on the data 
dictionary tables of the underlying RDBMS. However, 
this is not a complete solution as it would only af- 
fect the display servlet. We would need to expand 
this so that rows which could not be displayed were 
never presented to the user. This would have a noti- 
cable impact on performance. However, failure to do 
this would have two negative implications. Firstly, the 
system would present users with data which they could 
not access (this being analogous to returning 404s in 
a web search engine). Secondly, it might be possible 
to infer information without the rows being displayed. 
For example, if a company had an invoices table in- 
dexed, simply the presence of an entry (for example 
payee=enron or reason=takeover) might be consid- 
ered damaging. Until these issues are resolved, the ef- 
ficient indexing of secure data for unstructured search 
will be highly problematic. 

11 Concluding Remarks 

We have presented DbSurfer - a system for keyword 
search and navigation through relational databases. 
DbSurfer's unique feature is a novel join discovery al- 
gorithm which discovers Memex-like trails though the 
graph of foreign-to-primary key dependencies. Db- 
Surfer allows queries to be answered efficiently, provid- 
ing relevant results without relying on a translation to 
SQL. 
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